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Majority of the compute (and thus its infra) is on Spark 
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Apache Zeppelin 


Generally, these Spark batch computes are run on bare metal or VMs in all environments 
Most of the tech stack in the ecosystem can be containerised and managed by a Container orchestrator 
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Motivation To Move Batch Workloads To K8s 
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Agility 
Cost Optimisation / N e Hybrid app support 
e Reduced operation costs / Why e Better resource isolation 


* Unified tech stack for both . Dependency management 
on-prem and cloud / K8s? N 
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Containerisation 

* Portability of the apps 

* Better resource isolation 

* Dependency management 


DBS 


Raso 1. Enterprise ecosystem and motivation for 


Kubernetes (K8s) 


2. K8s Scheduling 


3. Desired Batch Scheduling Features 


4. Batch Schedulers on the Table 


5. What We Chose and Why 


6. Gaps and Enterprise Requirements 


DBS 


Kubernetes Scheduling — Pod Scheduling Context 


Pick a Pod from o Reserve a 
New pods go scheduling 7 Node for the 
through queue = Pod in Cache 
PreEnqueue ^ 
plugins 


New pods gated : | | ] Scheduling Cycle 


=  Kubernetes supports Custom scheduler 
= Default scheduler: Extensible Framework with 3 stages 
* Pod Queuing: Parking pods for certain conditions to be met 


* Scheduling: Filter — Pod filtering 
PostFilter - Mapping the ideal Node 


* Binding Cycle: Launching and bookkeeping 
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Bind Pod to 
Node 


Binding Cycle 


Kubernetes Scheduling 
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Pros 


Scheduling features 


Supports multiple deployment types enabling multiple types of 


Supports multiple resource types 
Supports labels, taints & tolerations 
Supports bin packing 

Supports dynamic resource allocation 
Supports pod priority & preemption 
Self healing — Node pressure eviction 
Supports Affinity & Anti-affinity 


applications 
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Limits/Quota 
* Resource Quotas are not part of the Default Scheduler 


* Namespace level Hard enforcement by rejection during 
submission 


e Lack of support for resource sharing between jobs, queues, 
and namespaces 


No First-class Application Concept 
e Lack of fine-grained lifecycle management 


e Lack of support for frameworks like Tensorflow, Pytorch, 
Mxnet 


Scheduling 
* Single Queue servicing the entire cluster 
* Single sorting algorithm — FiFO 


Scale and Performance 
* Cannot scale # of nodes (~7,500 nodes) 
e Comparatively slower for large batch workloads demands 
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Batch Scheduling Features 


= Better Capacity Planning 


Define capacity in terms of queues/pools with quotas 
and limits 

Define queue hierarchy 

User-based quota 

Preemption 


= Advanced Resource Scheduling Features With 
Application Awareness 
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Ordering policies: App, Node, Requests 
Gang scheduling 

App/Job priority 

Resource reservation 

High throughput 

Topology-based scheduling 

Reclaim and backfill 


Cle 


Tenant- b 


QueueB QueueC DA 
Priority 2 Priority 1 


Min: 60 cpu Min: 20 cpu 
Max: 60 cpu Max: 60 cpu 


QueueA 
Priority 1 


Min: 60 cpu 
Max: 100 cpu 


Batch Scheduling Features 


Cloud requirements 
* Scale based on the queue’s capacity not based on 
unscheduled pods 
* Node Sorting (Bin packing) to avoid unwanted node 
provisioning 
e Provisioning pods for an app in the same zone 
* Ability to manage budgets 


Observability and Troubleshooting 
* Centralized place to monitor current state 
* rend analysis of utilization 
e Troubleshooting and narrowing down to the 
troublesome app/user/queue/queue 
hierarchy 
e Admin CLIs to troubleshoot 
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Volcano fap 
Key dates Created in March 2019 
Accepted by CNCF as Incubator project in 
April 2022 
Maintained by CNCF 
Latest release 1.8.0 (Aug 2023) 
Major Adopters: Contribution trend: 
50 Gi Jun 25, 201 7 = Oct 3, 2023 Contributions: Commits v 
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Volcano Architecture 
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CRD + Controllers + Scheduler 


Kubernetes 
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Volcano Scheduler 


VOLCANO 


Scheduler: Volcano Scheduler schedules jobs to 
the most suitable node based on actions and plug- 
ins. Volcano supplements Kubernetes to support 
multiple scheduling algorithms for jobs. 


Controller Manager: Volcano CMs manage the 

lifecycle of Custom Resource Definitions (CRDs). 
You can use the Queue CM, PodGroup CM, and 

VCJob CM 


Admission: Volcano Admission is responsible for 
the CRD API validation 


Vcctl: Volcano vcctl is the command line client for 
Volcano 


Volcano Architecture 
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Jobinfos 
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OpenSession 
Job 


Jobinfo 
JobSpec 


PodGroup 


allocate 


Les | ssn.Bind 
if JobReady 


ssn.Evict 


Jobinfo if preemptable 


PodGroup 
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backfill 


CloseSession 


Re-construct Jobinfo in Cache by Predicate, allocate, preempt are 
PodGroup Actions, and they’re pluggable 


e Watches for and caches the jobs submitted by the client. 
* Opens a session periodically. A scheduling cycle begins 


PodGroup enqueue 


preempt 
PodGroup reclaim 
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DRF plugin 


Priority plugin | 
| JobOrderFn | | 
TaskOrderFn | 
PreemptableFn | 


Gang plugin 


JobOrderFn 
PreemptableFn 
JobReadyFn 


Plugins on demand 


* Sends jobs that are not scheduled in the cache to the to-be-scheduled queue in the session. 


e  Traverses all jobs to be scheduled. Executes enqueue, allocate, preempt, reclaim, and backfill actions in the order they are defined, 
and finds the most suitable node for each job. Binds the job to the node. The specific algorithm logic executed in the action depends 


on the implementation of each function in the registered plugins. 
e Closes this session. 
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Pros 


= Supports diverse scheduling algorithms 

e Gang scheduling 

* Fair-share scheduling 

* Queue scheduling 

e Preemption scheduling 

e  Topology-based scheduling 

e Reclaim 

e Backfill 

* Resource reservation 

* Supports to configure plugins and actions to use custom 
scheduling policies 

* Elastic jobs 


= Allows to use mainstream computing frameworks 
* Spark, TensorFlow, PyTorch, Flink, Argo, MindSpore, 
PaddlePaddle, Open MPI, Horovod, MXNet, Kubeflow, 
KubeGene, and Cromwell 


= Supports Federation in its latest release 
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Cons 


Architecture extends a particular version of K8s scheduler and 
completely replaces the default scheduler 
e All new features supported by default by K8s will not be 
available if scheduled through Volcano 
e All the supported deployment types from K8s are not 
supported by Volcano, and hence it's not a complete 
replacement for the default scheduler 


Though they have metrics that can be captured in Prometheus, 
the web-based toolkit isn't suitable for troubleshooting as the 
technology isn't mature enough 


Created in October 2019 
CNCF Sandbox in April 2022 


CNCF 


0.3.92 (Sep 2023) 


Major Adopters: Contribution trend: 
Jun 16, 2019 — Oct 3. 2023 Contributions: Commits v 
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C Contributions to master, excluding merge commits and bot accounts 
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Armada Architecture 


Events 
Database 


Database 


Lease jobs & 
report progress 
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Armada server: Responsible for accepting jobs from 
users and deciding in what order, and on which 
Kubernetes cluster the jobs should run. Users submit 
jobs to the Armada server through the armadactl 
command-line utility or via a gRPC or REST API. 


Armada executor: There is one instance running in each 
Kubernetes cluster that Armada is connected to. Each 
Armada executor instance regularly notifies the server of 
the amount of spare capacity available and requests for 
jobs to run. Users of Armada never interact with the 
executor directly. 


All states relating to the Armada server are stored in 
Redis, which may use replication combined with failover 
for redundancy. Hence, the Armada server is itself 
stateless, and is easily replicated by running multiple 
independent instances. Both the server and the 
executors are intended to run in Kubernetes pods. 


Armada Overview 
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Pros 


= Primarily designed for 

* Manage compute workloads on tens of thousands of nodes 
spreading across K8s clusters 

e High throughput, schedules >1,000 pods per second on 
average 

* Enqueue tens of thousands of jobs over a few seconds. 

* Divides resources fairly between users 

* Provides visibility for users and admins 

* Ensure near-constant uptime 


DBS 


al 


Cons 


Too many layers and components if the workload is not that large 


For advanced scheduling features, it needs to rely on the native 
scheduler. 


No integration with autoscaling 
Eventual consistent architecture 


Minimal observability 
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Key dates Created in October 2021 
CNCF Sandbox in April 2022 


Maintained by CNCF 


Number of contributors 20 (42) 
Latest release 0.4.1 (Aug 2023) 


Major Adopters: Contribution trend: 


© Oct 31, 2021 - Oct 4, 2023 
Contributions to main, excluding merge commits and bot accounts 


Google Kubernetes Engine 


<> 
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Contributions: Commits v 


Kueue Resource Model 


Key Design Principle: Reuse and extend existing 
API's to ensure there's no overlapping with existing 
Namespace component's scope 


name: robo-vision 


Existing Constructs 


Namespace: Canonical tenant abstraction 


Namespace 
name: robo-actions 


Job: Computation that runs to conclusion can have one 
or more similar or different pods 


Label : spot 


Node 


New Resource Constructs: 


L1] New resource type Existing resource type Queue/LocalQueue: Grouping, and managing closely 
related tenant jobs 


Cluster Queue: Cluster scoped resource that governs 
the pool of resources, defines usage limits boundaries 
of fair sharing 


ResourceFlavor: A set of labels that mirrors the labels 
on the nodes that offer those resources. 


Namespace 


Batch Workload 


abstracted by submitted to 
----- Workload OSS LocalQueue - 


-z477 | ClusterQueue Workload: Synonymous to a job that generally 
comprises one or more pods, and runs to completion. 


(e.g. v1.Job) 


Cohort : Group of Cluster Queues which can share the 
resources across the resources 
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Kueue Operations 


Batch Batch 

Admin O. Create ResourceFlavor(s) User 
' 1. Create batch/v1.Job 

ELUS TET QUMANE,, (queue-name:my-queue, suspend:true) © 


LocalQueue(s) 


2. Admit Job (based on order, quota etc.) 
3. Inject nodeAffinity based on selected 
flavor and unsuspend the job 


6. Provision 
more nodes 


4. Create 
5. Schedule v1.Pod(s) 
v1.Pod(s) 
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Pros 


Has been primarily designed to keep both onprem and Cloud in 
mind, where the latter requires provisions to scale differently for 
different resources are elastic and heterogenous 


Native to Kubernative and is therefore a complete package. 
Queuing is just additional controller 


Has been designed without much overlaps with other key 
components i.e. Pod Scheduling, Autoscaling, Job life cycle mgmt, 
etc.. 


Supports different type of jobs : V1.Job, RayJob, MPI JOB, Flux 
Minicluster, Python, JobSet and extendable Custom Jobs too. 
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Cons 


Still in experimental stage, not GA ready yet, though it's supported 
in GKE 


No gain in Scheduling performance. Cluster >5k will not be 
addressed with existing approach. 


K8s way of troubleshooting, needs more APIs to make it complete. 


Different perspective to Queue Management that might be 
complicated for existing YARN users to comprehend. 


Considers all POD's of an app/job equal. 
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Yunikorn 


Key dates Introduced as an Apache Incubation 
project in 2020 


Top Level project in 2022 


Maintained by Apache software foundation 


Number of contributors 


Original creator Cloudera and Apple 
Latest release 1.3 (Jun 2023) 


Major Adopters: Contribution trend: 
C LÀ Mar 10, 2019 — Oct x A 2023 Contributions: Commits v 
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Contributions to master, excluding merge commits and bot accounts 
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Yunikorn Architecture 


Admission 
Controller 


Container Orchestration Systems 


Yunikorn Scheduler Interface 


Yunikom 
K8s-Shim 
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Cloud Instances 


Pick a Pod from Reserve a 


scheduling - Node for the 


queue Pod in Cache 
PreEnqueue 


plugins 


New pods gated 


Scheduling Cycle 


Bind Pod to 
Node 


Binding Cycle 


Architecture 
=  Yunikorn Scheduler Pod 
e Yunikorn scheduler: Core scheduling 
decisions 
e Yunikorn Web: For web and ReST 
interface 
= Admission Controller Pod (optional) 
e  AdmissionController: Filtering 
Namespace requests, User mapping 
or limiting 


Scheduler 
= Extends K8s scheduler 
= 4 stages are mainly extended 
e PreEnqueue 


e PreFilter 
e Filter 
e PostBind 


Yunikorn Overview 
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Pros Cons 
= Predominantly founders were also YARN developers. As such, = Scaling out for larger clusters(>5k) is yet to be addressed. 
Yunikorn scheduler feature set almost matches and in certain E 


Design of native K8s and Yunikorn seems to be diverging, need to 


cases are even optimized than YARN. Highlights include: see on the sustainability in the long term. 


= User/group level resource mapping in Queue hierarchy 


= Sorting at various levels App, Node and Resource weighting = Requires support for Cloud centric features like budgets, zone 
= Gang scheduling specific placement of App’s pods needs to be considered 
* Core Design principle has been to K8s default scheduler, which = K8's mapping Yunikorn build needs to be created to ensure all the 
enables it to be applied at a cluster level functionalities are intact. 
= Better troubleshooting with Web UI and rest-based “activity 
dump” 


= Better options to map Namespace to Queues 

= Better support for public cloud as it supports Bin packing and 
enables Queue based scaling support (upcoming version for 
plugin model). 

= Performance is more optimised. 
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Why and what we choose 


= Meets most of the Batch scheduling requirements 
* Able to gracefully to scale out in Cloud 
* Fine grained capacity management 
* Optimised observability and troubleshooting 
e Application aware resource scheduling 
= Proven solution at large scale and adopted by large enterprises 
= Continuous contributions over a period. 
= Active community, which is a mix of organisations 
= Not deviating much from K8s scheduling and is easily maintainable in the future 
e Single unified scheduler for all K8s workload 
= Sufficient to meet our scale and performant enough to schedule for batch workloads 
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Gaps and Enterprise Requirements 


We look forward to the following features and aspire to contribute: 
= Cloud requirements 

e Tenant based Budget support 

e Deploying a job on a single zone to remove shuffle costs 


= There are gaps at Observability when compared to YARN which still needs improvement, 
e Queues pending apps and reason for not being launched 
e Consumption at user/group level within the Queue. 
* App level metrics : 
* Pending requests 
e Link to the application UI if available 
* Logs (if possible ?) 


BDBS 


DBS 


Thank You! 


